70 research outputs found
On pairwise distances and median score of three genomes under DCJ
In comparative genomics, the rearrangement distance between two genomes
(equal the minimal number of genome rearrangements required to transform them
into a single genome) is often used for measuring their evolutionary
remoteness. Generalization of this measure to three genomes is known as the
median score (while a resulting genome is called median genome). In contrast to
the rearrangement distance between two genomes which can be computed in linear
time, computing the median score for three genomes is NP-hard. This inspires a
quest for simpler and faster approximations for the median score, the most
natural of which appears to be the halved sum of pairwise distances which in
fact represents a lower bound for the median score.
In this work, we study relationship and interplay of pairwise distances
between three genomes and their median score under the model of
Double-Cut-and-Join (DCJ) rearrangements. Most remarkably we show that while a
rearrangement may change the sum of pairwise distances by at most 2 (and thus
change the lower bound by at most 1), even the most "powerful" rearrangements
in this respect that increase the lower bound by 1 (by moving one genome
farther away from each of the other two genomes), which we call strong, do not
necessarily affect the median score. This observation implies that the two
measures are not as well-correlated as one's intuition may suggest.
We further prove that the median score attains the lower bound exactly on the
triples of genomes that can be obtained from a single genome with strong
rearrangements. While the sum of pairwise distances with the factor 2/3
represents an upper bound for the median score, its tightness remains unclear.
Nonetheless, we show that the difference of the median score and its lower
bound is not bounded by a constant.Comment: Proceedings of the 10-th Annual RECOMB Satellite Workshop on
Comparative Genomics (RECOMB-CG), 2012. (to appear
A Computational Method for the Rate Estimation of Evolutionary Transpositions
Genome rearrangements are evolutionary events that shuffle genomic
architectures. Most frequent genome rearrangements are reversals,
translocations, fusions, and fissions. While there are some more complex genome
rearrangements such as transpositions, they are rarely observed and believed to
constitute only a small fraction of genome rearrangements happening in the
course of evolution. The analysis of transpositions is further obfuscated by
intractability of the underlying computational problems.
We propose a computational method for estimating the rate of transpositions
in evolutionary scenarios between genomes. We applied our method to a set of
mammalian genomes and estimated the transpositions rate in mammalian evolution
to be around 0.26.Comment: Proceedings of the 3rd International Work-Conference on
Bioinformatics and Biomedical Engineering (IWBBIO), 2015. (to appear
Limited Lifespan of Fragile Regions in Mammalian Evolution
An important question in genome evolution is whether there exist fragile
regions (rearrangement hotspots) where chromosomal rearrangements are happening
over and over again. Although nearly all recent studies supported the existence
of fragile regions in mammalian genomes, the most comprehensive phylogenomic
study of mammals (Ma et al. (2006) Genome Research 16, 1557-1565) raised some
doubts about their existence. We demonstrate that fragile regions are subject
to a "birth and death" process, implying that fragility has limited
evolutionary lifespan. This finding implies that fragile regions migrate to
different locations in different mammals, explaining why there exist only a few
chromosomal breakpoints shared between different lineages. The birth and death
of fragile regions phenomenon reinforces the hypothesis that rearrangements are
promoted by matching segmental duplications and suggests putative locations of
the currently active fragile regions in the human genome
Genome aliquoting with double cut and join
<p>Abstract</p> <p>Background</p> <p>The <it>genome aliquoting probem </it>is, given an observed genome <it>A </it>with <it>n </it>copies of each gene, presumed to descend from an <it>n</it>-way polyploidization event from an ordinary diploid genome <it>B</it>, followed by a history of chromosomal rearrangements, to reconstruct the identity of the original genome <it>B'</it>. The idea is to construct <it>B'</it>, containing exactly one copy of each gene, so as to minimize the number of rearrangements <it>d</it>(<it>A, B' </it>⊕ <it>B' </it>⊕ ... ⊕ <it>B'</it>) necessary to convert the observed genome <it>B' </it>⊕ <it>B' </it>⊕ ... ⊕ <it>B' </it>into <it>A</it>.</p> <p>Results</p> <p>In this paper we make the first attempt to define and solve the genome aliquoting problem. We present a heuristic algorithm for the problem as well the data from our experiments demonstrating its validity.</p> <p>Conclusion</p> <p>The heuristic performs well, consistently giving a non-trivial result. The question as to the existence or non-existence of an exact solution to this problem remains open.</p
A Unifying Model of Genome Evolution Under Parsimony
We present a data structure called a history graph that offers a practical
basis for the analysis of genome evolution. It conceptually simplifies the
study of parsimonious evolutionary histories by representing both substitutions
and double cut and join (DCJ) rearrangements in the presence of duplications.
The problem of constructing parsimonious history graphs thus subsumes related
maximum parsimony problems in the fields of phylogenetic reconstruction and
genome rearrangement. We show that tractable functions can be used to define
upper and lower bounds on the minimum number of substitutions and DCJ
rearrangements needed to explain any history graph. These bounds become tight
for a special type of unambiguous history graph called an ancestral variation
graph (AVG), which constrains in its combinatorial structure the number of
operations required. We finally demonstrate that for a given history graph ,
a finite set of AVGs describe all parsimonious interpretations of , and this
set can be explored with a few sampling moves.Comment: 52 pages, 24 figure
Efficient algorithms for analyzing segmental duplications with deletions and inversions in genomes
Background: Segmental duplications, or low-copy repeats, are common in mammalian genomes. In the human genome, most segmental duplications are mosaics comprised of multiple duplicated fragments. This complex genomic organization complicates analysis of the evolutionary history of these sequences. One model proposed to explain this mosaic patterns is a model of repeated aggregation and subsequent duplication of genomic sequences. Results: We describe a polynomial-time exact algorithm to compute duplication distance, a genomic distance defined as the most parsimonious way to build a target string by repeatedly copying substrings of a fixed source string. This distance models the process of repeated aggregation and duplication. We also describe extensions of this distance to include certain types of substring deletions and inversions. Finally, we provide an description of a sequence of duplication events as a context-free grammar (CFG). Conclusion: These new genomic distances will permit more biologically realistic analyses of segmental duplications in genomes.
Sampling and counting genome rearrangement scenarios
Even for moderate size inputs, there are a tremendous number of optimal rearrangement scenarios, regardless what the model is and which specific question is to be answered. Therefore giving one optimal solution might be misleading and cannot be used for statistical inferring. Statistically well funded methods are necessary to sample uniformly from the solution space and then a small number of samples are sufficient for statistical inferring
Comparative Analysis of DNA Replication Timing Reveals Conserved Large-Scale Chromosomal Architecture
Recent evidence suggests that the timing of DNA replication is coordinated across megabase-scale domains in metazoan genomes, yet the importance of this aspect of genome organization is unclear. Here we show that replication timing is remarkably conserved between human and mouse, uncovering large regions that may have been governed by similar replication dynamics since these species have diverged. This conservation is both tissue-specific and independent of the genomic G+C content conservation. Moreover, we show that time of replication is globally conserved despite numerous large-scale genome rearrangements. We systematically identify rearrangement fusion points and demonstrate that replication time can be locally diverged at these loci. Conversely, rearrangements are shown to be correlated with early replication and physical chromosomal proximity. These results suggest that large chromosomal domains of coordinated replication are shuffled by evolution while conserving the large-scale nuclear architecture of the genome
Reconstructing cancer genomes from paired-end sequencing data
<p>Abstract</p> <p>Background</p> <p>A cancer genome is derived from the germline genome through a series of somatic mutations. Somatic structural variants - including duplications, deletions, inversions, translocations, and other rearrangements - result in a cancer genome that is a scrambling of intervals, or "blocks" of the germline genome sequence. We present an efficient algorithm for reconstructing the block organization of a cancer genome from paired-end DNA sequencing data.</p> <p>Results</p> <p>By aligning paired reads from a cancer genome - and a matched germline genome, if available - to the human reference genome, we derive: (i) a partition of the reference genome into intervals; (ii) adjacencies between these intervals in the cancer genome; (iii) an estimated copy number for each interval. We formulate the Copy Number and Adjacency Genome Reconstruction Problem of determining the cancer genome as a sequence of the derived intervals that is consistent with the measured adjacencies and copy numbers. We design an efficient algorithm, called Paired-end Reconstruction of Genome Organization (PREGO), to solve this problem by reducing it to an optimization problem on an interval-adjacency graph constructed from the data. The solution to the optimization problem results in an Eulerian graph, containing an alternating Eulerian tour that corresponds to a cancer genome that is consistent with the sequencing data. We apply our algorithm to five ovarian cancer genomes that were sequenced as part of The Cancer Genome Atlas. We identify numerous rearrangements, or structural variants, in these genomes, analyze reciprocal vs. non-reciprocal rearrangements, and identify rearrangements consistent with known mechanisms of duplication such as tandem duplications and breakage/fusion/bridge (B/F/B) cycles.</p> <p>Conclusions</p> <p>We demonstrate that PREGO efficiently identifies complex and biologically relevant rearrangements in cancer genome sequencing data. An implementation of the PREGO algorithm is available at <url>http://compbio.cs.brown.edu/software/</url>.</p
progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement
Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms.We describe a new method to align two or more genomes that have undergone rearrangements due to recombination and substantial amounts of segmental gain and loss (flux). We demonstrate that the new method can accurately align regions conserved in some, but not all, of the genomes, an important case not handled by our previous work. The method uses a novel alignment objective score called a sum-of-pairs breakpoint score, which facilitates accurate detection of rearrangement breakpoints when genomes have unequal gene content. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The new genome alignment algorithm demonstrates high accuracy in situations where genomes have undergone biologically feasible amounts of genome rearrangement, segmental gain and loss. We apply the new algorithm to a set of 23 genomes from the genera Escherichia, Shigella, and Salmonella. Analysis of whole-genome multiple alignments allows us to extend the previously defined concepts of core- and pan-genomes to include not only annotated genes, but also non-coding regions with potential regulatory roles. The 23 enterobacteria have an estimated core-genome of 2.46Mbp conserved among all taxa and a pan-genome of 15.2Mbp. We document substantial population-level variability among these organisms driven by segmental gain and loss. Interestingly, much variability lies in intergenic regions, suggesting that the Enterobacteriacae may exhibit regulatory divergence.The multiple genome alignments generated by our software provide a platform for comparative genomic and population genomic studies. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve
- …